Goto

Collaborating Authors

 Questionnaire & Opinion Survey


Results on FAVOR Bench

Neural Information Processing Systems

Prompt Template: Generating QAPairs for Camera Motion (CM) Task You are a professional question designer focusing on temporal dynamics in videos, including camera movements, motions, activities, and interactions, rather than static content. You will receive detailed annotations about the temporal details of the entire video, with duration markers in parentheses after "camera_motion" and "motion_list". Based on these annotations, design 3 multiple-choice questions around the "Camera Motion" theme to test models' fine-grained video motion understanding, particularly: Understanding camera movement direction and focus changes in the video. Additionally, follow these question design guidelines: 1. If a video's "camera_motion" has only one element, such as "camera_motion": "static", or "camera_motion": "camera shaking (0-22)", skip this video and don't generate any content.


BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks

Neural Information Processing Systems

Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different domains. However, finding suitable benchmarks is difficult given the many available options. This complexity not only increases the risk of benchmark misuse and misinterpretation but also demands substantial effort from LLM users, seeking the most suitable benchmarks for their specific needs. To address these issues, we introduce BenchmarkCards, an intuitive and validated documentation framework that standardizes critical benchmark attributes such as objectives, methodologies, data sources, and limitations. Through user studies involving benchmark creators and users, we show that BenchmarkCardscan simplify benchmark selection and enhance transparency, facilitating informed decision-making in evaluating LLMs.


Direct Alignment with Heterogeneous Preferences

Neural Information Processing Systems

Alignment with human preferences is commonly framed using a universal reward function, even though human preferences are inherently heterogeneous. We formalize this heterogeneity by introducing user types and examine the limits of the homogeneity assumption. We show that aligning to heterogeneous preferences with a single policy is best achieved using the average reward across user types. However, this requires additional information about annotators. We examine improvements under different information settings, focusing on direct alignment methods. We find that minimal information can yield first-order improvements, while full feedback from each user type leads to consistent learning of the optimal policy. Surprisingly, however, no sample-efficient consistent direct loss exists in this latter setting. These results reveal a fundamental tension between consistency and sample efficiency in direct policy alignment.


Epic Games details how it's embracing generative AI in Unreal Engine

Engadget

Just over half of game developers think gen AI is bad for the industry, according to a report published earlier this year. During The State of Unreal keynote at Unreal Fest on Wednesday, Epic Games revealed just how it's embracing generative AI in Unreal Engine (UE). Along with offering the first details on Unreal Engine 6 (UE6), the company discussed new features for Unreal Engine 5.8, which it also released on Wednesday. As part of the latest update, Epic is offering an experimental Model Context Protocol (MCP) plugin that will allow developers to hook gen AI models such as Claude and Gemini into Unreal Engine. It's looking to make the MCP an integral part of UE6.


Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

Neural Information Processing Systems

Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies.


Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

Neural Information Processing Systems

Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play. While these simulations enable scalable training and evaluation of AI agents, off-the-shelf LLMs often drift from their assigned personas, contradict earlier statements, or abandon role-appropriate behavior. We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue. We define three automatic metrics--prompt-to-line consistency, line-to-line consistency, and Q&A consistency--that capture different types of persona drift and validate each against human annotations. Using these metrics as reward signals, we apply multiturn reinforcement learning to fine-tune LLMs for three user roles: a patient, a student, and a social chat partner. Our method reduces inconsistency by over 55%, resulting in more coherent, faithful, and trustworthy simulated users.


Enhancing Interpretability in Deep Reinforcement Learning through Semantic Clustering

Neural Information Processing Systems

In this paper, we explore semantic clustering properties of deep reinforcement learning (DRL) to improve its interpretability and deepen our understanding of its internal semantic organization. In this context, semantic clustering refers to the ability of neural networks to cluster inputs based on their semantic similarity in the feature space. We propose a DRL architecture that incorporates a novel semantic clustering module that combines feature dimensionality reduction with online clustering.


Social networks, online video outweigh traditional media in 2026

The Japan Times

News consumers around the world are now turning more to social media and video platforms than traditional outlets for information, a report has found. News consumers around the world are now turning more to social media and video platforms than traditional outlets for information, a report said Tuesday, warning that old-style business models are under threat. The year 2026 marks "a significant milestone: for the first time, social media and video network consumption is now ahead of other news sources as the most widely used source of news globally," at 54%, wrote Jim Egan, lead author of the report from the Reuters Institute for the Study of Journalism. The annual report from the institute, attached to the University of Oxford, is a closely-watched tracker of trends reshaping the news media. Researchers based their findings on online surveys of almost 100,000 people in 48 countries, run earlier this year by pollster YouGov. This year's edition found 54% of respondents said they got news from social media or video platforms in the week before the survey -- rising to 56% if AI chatbots like ChatGPT were included.


CGBENCH: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research

Neural Information Processing Systems

Variant and gene interpretation are fundamental to personalized medicine and translational biomedicine. However, traditional approaches are manual and labor-intensive. Generative language models (LMs) can facilitate this process, accelerating the translation of fundamental research into clinically-actionable insights. While existing benchmarks have attempted to quantify the capabilities of LMs for interpreting scientific data, these studies focus on narrow tasks that do not translate to real-world research. To meet these challenges, we introduce CGBENCH, a robust benchmark that tests reasoning capabilities of LMs on scientific publications.


Baltic states fear Russia-Ukraine war spillover after drone incursions

Al Jazeera

Recent incidents heighten anxieties that hybrid warfare tactics could trigger military confrontation with Russia. Lithuanian armed special forces and members of the Lithuanian Riflemen's Union take part in a military exercise in central Lithuania [File: Nils Adler/Al Jazeera] A member of the Lithuanian Riflemen's Union joins in military exercises in central Lithuania [File: Nils Adler/Al Jazeera] Along the forests and marshlands that separate the Baltic states from Russia and Belarus, workers are digging anti-tank ditches, pouring concrete bunkers and erecting rows of dragon's teeth - jagged concrete obstacles designed to slow and channel advancing armour - to buy precious time in the event of an attack. Russia's full-scale invasion of Ukraine in 2022 reignited old fears in Estonia, Latvia and Lithuania, where memories of Soviet rule remain close to the surface. In the years since, those fears have been channelled into preparation. Defence budgets have surged, military exercises have intensified, and new fortifications have emerged even as daily life largely continues as normal.